The following packages are used in this project:
library(tidyverse) # For data wrangling
library(readxl) # To import excel and csv files
library(countrycode) # To assign the names of the continents for each country
library(readr) # To export the dataframe
library(ggthemes) # Additional themes for graphics
library(plotly) # To make interactive plots
library(gganimate) # To create animated plots
In this project, I will recreate three single year plots for the years 1800, 1950 and 2018 (latest available data) as shown in Hans Rosling’s video here:
life expectancy and wealth, 2009
Additionally, I am going to create an animated plot of the last 50 years of data using gganimate. Finally, I will create an interactive plot with 2018 data using plotly.
The files used for this project, downloaded from https://www.gapminder.org/data/, can be found in the “data” folder of this project.
There are CSV files for the following:
Income: “data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv”
Life expectancy (years): “data/life_expectancy_years.csv”
Population: “data/population_total.csv”
Importing the 3 CSV datasets for processing, first the income or GDP per capita, then the life expectancy and finally the population.
# Importing the datasets and assigning shorter names
# check.names is used so the name of the variable does not change after importing.
inc <- read.csv("data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", check.names = FALSE)
lif <- read.csv("data/life_expectancy_years.csv", check.names = FALSE)
pop <- read.csv("data/population_total.csv", check.names = FALSE)
In the datasets, the name of the continent for each country is missing. We can get it by importing another file with the countries and contients or using a package called countrycode.
# Adding the continent to the life expectancy table using countrycode package
lif$continent <- countrycode(sourcevar = lif[, "country"],
origin = "country.name",
destination = "continent")
Because the datasets are breaking the rules of tidy data, we need to pivot longer all 3 tables so we can use this data properly.
life_expectancy_long <- lif %>%
# This columns will stay with the new variables
pivot_longer(cols = c(-country, -continent),
# This will create the variables year and life_expectancy
names_to = "year",
values_to = "life_expectancy")
income_long <- inc %>%
pivot_longer(-country,
names_to = "year",
values_to = "income")
population_long <- pop %>%
pivot_longer(-country,
names_to = "year",
values_to = "population")
Because the income and population datasets include projections, I’m using the life expectancy table which has the latest data without projections.
# Using a left join to keep the data of the main table life_expectancy_long
allcountries <-
life_expectancy_long %>%
# This columns are the
left_join(income_long, by = c("country", "year"))
allcountries <- allcountries %>%
left_join(population_long, by = c("country", "year"))
# Checking the first rows to see the joined data.
head(allcountries)
## # A tibble: 6 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Afghanistan Asia 1800 28.2 603 3280000
## 2 Afghanistan Asia 1801 28.2 603 3280000
## 3 Afghanistan Asia 1802 28.2 603 3280000
## 4 Afghanistan Asia 1803 28.2 603 3280000
## 5 Afghanistan Asia 1804 28.2 603 3280000
## 6 Afghanistan Asia 1805 28.2 603 3280000
Exporting the clean dataset as a CSV can be useful for later manipulation and save all the data importing and cleaning steps above.
write_csv(allcountries, "clean_df.csv")
For the static plots I decided to go with the years 1800, 1950 and 2018 to compare how the variables income and life expectancy changed over this years.
# Assigning a name to the plot
chart1 <- allcountries %>%
# Filtering the year of interest
filter(year %in% c("1800")) %>%
# Assigning the variables to the axes and coloring the distinct continents
ggplot(aes(x = income, y = life_expectancy, colour = continent)) +
# Adding the plots where the bigger the population, the bigger the plots.
# Using alpha to add transparency.
geom_point(aes(size = population), alpha = .5) +
# Adding the limits and breaks for the y axis.
# This scale is continuous because age does not change that much overtime.
scale_y_continuous(
limits = c(10, 95),
breaks = c(25, 50, 75)) +
# Adding the limits and breaks for the x axis.
# This scale is logarithmic to better represent the change over time.
scale_x_log10(
limits = c(400, 120000),
breaks = c(400, 4000, 40000))
# Displaying the base plot.
chart1
# Repeating the same process for the years 1950 and 2018.
chart2 <- allcountries %>%
filter(year %in% c("1950")) %>%
ggplot(aes(x = income, y = life_expectancy, colour = continent)) +
geom_point(aes(size = population), alpha = .5) +
scale_y_continuous(
limits = c(10, 95),
breaks = c(25, 50, 75)) +
scale_x_log10(
limits = c(400, 120000),
breaks = c(400, 4000, 40000))
chart2
chart3 <- allcountries %>%
filter(year %in% c("2018")) %>%
ggplot(aes(x = income, y = life_expectancy, colour = continent)) +
geom_point(aes(size = population), alpha = .5) +
scale_y_continuous(
limits = c(10, 95),
breaks = c(25, 50, 75)) +
scale_x_log10(
limits = c(400, 120000),
breaks = c(400, 4000, 40000))
chart3
Finding interesting data points to show in our plots, including maximum and minimum values for income and life expectancy.
# For the year 1800
allcountries %>%
filter(year %in% c("1800")) %>%
# To see the country with highest life expectancy.
slice_max(life_expectancy)
## # A tibble: 1 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Iceland Europe 1800 42.9 926 61400
allcountries %>%
filter(year %in% c("1800")) %>%
# To see the country with highest income.
slice_max(income)
## # A tibble: 1 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Netherlands Europe 1800 39.9 4230 2250000
# For the year 1950
allcountries %>%
filter(year %in% c("1950")) %>%
slice_max(life_expectancy)
## # A tibble: 1 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Norway Europe 1950 71.6 11400 3270000
allcountries %>%
filter(year %in% c("1950")) %>%
slice_max(income)
## # A tibble: 1 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Brunei Asia 1950 55.5 56600 48000
allcountries %>%
filter(year %in% c("1950")) %>%
slice_min(life_expectancy)
## # A tibble: 1 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Yemen Asia 1950 23.8 1340 4400000
# For the year 2018
allcountries %>%
filter(year %in% c("2018")) %>%
slice_max(life_expectancy)
## # A tibble: 1 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Japan Asia 2018 84.2 39100 127000000
allcountries %>%
filter(year %in% c("2018")) %>%
slice_min(life_expectancy)
## # A tibble: 1 x 6
## country continent year life_expectancy income population
## <chr> <chr> <chr> <dbl> <int> <int>
## 1 Lesotho Africa 2018 51.1 2960 2260000
To make the plots more attractive, we need to use the right formatting. In the code below I’m adding themes, changing colors, adding labels and reference lines.
# Saving the final plots
lifeexp1800 <- chart1 +
# Adding labels to show information about the plot, like title, labels.
labs(title = "Income versus Life Expectancy in 1800",
x = "Income (GDP per capita in USD $)",
y = "Life expectancy (years)",
caption = "Source: Gapminder",
size = "Population (millions)",
color = "Continent") +
# Setting the size of the bubbles with the same scale for the 3 plots
scale_size(
# The range is the size of the bubbles, higher would mean bigger difference between bubbles.
range = c(0.1, 15),
# The limits of population from NA to the highest which is China just under 1500 million.
limits = c(NA,1500000000),
# Multiplying the population by millions so it would be easier to read.
breaks = 1000000 * c(10, 50, 100, 500, 1000, 1500),
labels = c("10", "50", "100", "500", "1000", "1500")) +
# Changing the color palette
scale_colour_brewer(palette = "Set1") +
# Changing the default theme
theme_classic() +
# Using the override function to make the color label bigger
guides(color = guide_legend(override.aes = list(size = 5, alpha = .5))) +
# Adding the vertical and horizontal lines to replicate Hans Rosling's plots
geom_vline(xintercept = c(400, 4000, 40000),
# Adding a light color, with small size and .5 transparency so it is not distracting.
color = "grey", size = .2, alpha = .5) +
# Doing the same on the y axes.
geom_hline(yintercept = c(25, 50, 75),
color = "grey", size = .2, alpha = .5) +
# Labeling the countries with the interesting points we found earlier.
# The life_expectancy has a +5 to show the label higher on the y axes.
geom_text(aes(x = income, y = life_expectancy + 5, label = country),
color = "grey50",
# Filtering the data to show only the 2 countries we want.
data = filter(allcountries, year == 1800, country %in% c("Iceland", "Netherlands")))
# Showing the result
lifeexp1800
# Repeating the same process for the next 2 plots.
lifeexp1950 <- chart2 +
labs(title = "Income versus Life Expectancy in 1950",
x = "Income (GDP per capita in USD $)",
y = "Life expectancy (years)",
caption = "Source: Gapminder",
size = "Population (millions)",
color = "Continent") +
scale_size(
range = c(0.1, 15),
limits = c(NA,1500000000),
breaks = 1000000 * c(10, 50, 100, 500, 1000, 1500),
labels = c("10", "50", "100", "500", "1000", "1500")) +
scale_colour_brewer(palette = "Set1") +
theme_classic() +
guides(color = guide_legend(override.aes = list(size = 5))) +
geom_vline(xintercept = c(400, 4000, 40000),
color = "grey", size = .2, alpha = .5) +
geom_hline(yintercept = c(25, 50, 75),
color = "grey", size = .2, alpha = .5) +
geom_text(aes(x = income, y = life_expectancy + 5, label = country),
color = "grey50",
data = filter(allcountries, year == 1950, country %in% c("Norway", "Brunei", "Yemen")))
lifeexp1950
lifeexp2018 <- chart3 +
labs(title = "Income versus Life Expectancy in 2018",
x = "Income (GDP per capita in USD $)",
y = "Life expectancy (years)",
caption = "Source: Gapminder",
size = "Population (millions)",
color = "Continent") +
scale_size(
range = c(0.1, 15),
limits = c(NA,1500000000),
breaks = 1000000 * c(10, 50, 100, 500, 1000, 1500),
labels = c("10", "50", "100", "500", "1000", "1500")) +
scale_colour_brewer(palette = "Set1") +
theme_classic() +
guides(color = guide_legend(override.aes = list(size = 5))) +
geom_vline(xintercept = c(400, 4000, 40000),
color = "grey", size = .2, alpha = .5) +
geom_hline(yintercept = c(25, 50, 75),
color = "grey", size = .2, alpha = .5) +
geom_text(aes(x = income, y = life_expectancy + 5, label = country),
color = "grey50",
data = filter(allcountries, year == 2018, country %in% c("Japan", "Lesotho")))
lifeexp2018
Saving the plots as images can be useful for sharing.
# Adding the name and image extension of the file you want.
ggsave("lifeexp1800.jpg",
# Selecting the plot. you want to save.
lifeexp1800,
# Specifying the aspect ratio.
width = 16, height = 9)
# Repeating the process for the next plots.
ggsave("lifeexp1950.jpg",
lifeexp1950,
width = 16, height = 9)
ggsave("lifeexp2018.jpg",
lifeexp2018,
width = 16, height = 9)
It’s time to try and recreate the animation like Hans Rosling’s video. In this animation we’re going to see how the life expectancy and income have changed over the last 50 years.
# Creating a separate dataset for the animation
anim_data <- allcountries %>%
# Changing variable type to integer, because it was a chr and it has to be numeric.
mutate(year= as.integer(year)) %>%
# Setting the range of years for the animation
filter(year %in% (1968:2018))
# Adding the same format as our previous plots.
anim_output <- ggplot(anim_data, aes(income, life_expectancy, size = population, color = continent, frame = year)) +
labs(x="Income (GDP per capita in USD $)",
y = "Life Expectancy (years)",
caption = "Source: Gapminder",
size = "Population (millions)",
color = 'Continent') +
scale_y_continuous(
limits = c(10, 95),
breaks = c(25, 50, 75)) +
scale_x_log10(
limits = c(400, 120000),
breaks = c(400, 4000, 40000)) +
scale_colour_brewer(palette = "Set1") +
theme_classic() +
guides(color = guide_legend(override.aes = list(size = 5))) +
geom_vline(xintercept = c(400, 4000, 40000),
color = "grey", size = .2, alpha = .5) +
geom_hline(yintercept = c(25, 50, 75),
color = "grey", size = .2, alpha = .5) +
geom_point(aes(), alpha = .5) +
scale_size(
range = c(0.1, 15),
limits = c(NA,1500000000),
breaks = 1000000 * c(10, 50, 100, 500, 1000, 1500),
labels = c("10", "50", "100", "500", "1000", "1500")) +
# gganimate parameters, title with year changing with the plots.
ggtitle("Income versus Life Expectancy, year: {frame_time}") +
transition_time(year) +
ease_aes("linear") +
enter_fade() +
exit_fade()
# Animation parameter, duration, frames per second, size and using the gifski renderer for the output..
animate(anim_output, duration = 10, fps = 20, width = 800, height = 400, renderer = gifski_renderer())
# Saving the animation as a GIF
anim_save("capstone_animation.gif")
To show more of the plotting capabilities in R, I decided to add an interactive visualization using Plotly. Hovering the mouse over each dot shows each country’s information from out dataset, we also have the ability to zoom in/out, select and export.
# Filtering the data to use the latest year
int_data <- allcountries %>%
filter(year=="2018")
# Interactive version, using mutate to create new variables to show in the tooltip.
interactive <- int_data %>%
# Mutating and rounding the income to 0 decimals.
mutate(income=round(income,0)) %>%
# Mutating and dividing the population so it would be easier to read.
mutate(population=round(population/1000000,2)) %>%
# Mutating life expectancy and rounding with 1 decimal.
mutate(life_expectancy=round(life_expectancy,1)) %>%
# Reordering the countries
arrange(desc(population)) %>%
mutate(country = factor(country, country)) %>%
# Text for tooltip
mutate(text = paste("Country: ", country, "\nPopulation (M): ", population, "\nLife Expectancy: ", life_expectancy, "\nIncome: ", income, sep="")) %>%
# Creating the plot
ggplot( aes(x=income, y=life_expectancy, size = population, color = continent, text=text)) +
geom_point(aes(size = population), alpha = .5) +
scale_y_continuous(
limits = c(50, 90),
breaks = c(50, 60, 70, 80, 90)) +
scale_x_log10(
limits = c(400, 120000),
breaks = c(400, 4000, 40000)) +
scale_colour_brewer(palette = "Set1") +
theme_classic() +
theme(legend.position="none") +
labs(title = "Income versus Life Expectancy in 2018",
x = "Income (GDP per capita in USD $)",
y = "Life expectancy (years)",
caption = "Source: Gapminder")
# Using plotly to make the plot interactive and show each country information with the mouse over.
int2018 <- ggplotly(interactive, tooltip="text")
int2018
Using the lm() method:
gapminder_model <- lm(income ~ life_expectancy, data = allcountries)
gapminder_model
##
## Call:
## lm(formula = income ~ life_expectancy, data = allcountries)
##
## Coefficients:
## (Intercept) life_expectancy
## -10895.1 359.7
Summary of the statistical model:
summary(gapminder_model)
##
## Call:
## lm(formula = income ~ life_expectancy, data = allcountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13981 -2717 11 1335 163428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10895.085 117.463 -92.75 <2e-16 ***
## life_expectancy 359.708 2.547 141.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8356 on 40435 degrees of freedom
## (516 observations deleted due to missingness)
## Multiple R-squared: 0.3303, Adjusted R-squared: 0.3303
## F-statistic: 1.994e+04 on 1 and 40435 DF, p-value: < 2.2e-16
With this summary, we can conclude that the model is statistically significant because it has a small P value. However, R-squared shows that there’s a weak correlation (0.33) between income and life expectancy. We can’t use this model to accurately predict life expectancy in the future because there are other factors in affecting life expectancy.